Masthead

Lab: Categorical And Regression Trees (CART)

Note for Jim for next year: change Xs and Ys to Covariate1 and Covariate2.

This lab will introduce you to modeling using Categorical and Regression Trees (CART). These trees can model continuous or categorical data but are particularly well suited for categorical response data. In R there are two CART libraries, tree and rpart. Rpart is the newer of the two and is used by Plant so we'll use it.

Note that for this lab, you'll need the rpart and the sp (spatial) libraries and use na.omit() to remove invalid values or your prediction will have a different number of rows than your original data.

Exploring Together

Functions for Synthetic Data

The code below contains a series of functions that we'll be using to create synthetic data sets in R. Copy this code into R Studio and compile it to create the functions.

library(rpart)
library(sp)
######################################################################################
# Creates a two dimensional pattern of data increasing from the lower left
# to the upper right
######################################################################################

Linear2D=function(NumEntries=100)
{
  Ys=as.vector(array(1:NumEntries))
  Xs=as.vector(array(1:NumEntries))
  Measures=as.vector(array(1:NumEntries))
  for (Index in 1:NumEntries)
  {
    Xs[Index]=runif(1,0,100)
    Ys[Index]=runif(1,0,100)
    Measures[Index]=Xs[Index]+Ys[Index]
  }
  TheDataFrame = data.frame(Ys, Xs, Measures)
}

######################################################################################
# Creates a 2 dimensional data set with a single split to the data in the X direction.
######################################################################################

Categories1D=function(NumEntries=100)
{
  Ys=as.vector(array(1:NumEntries))
  Xs=as.vector(array(1:NumEntries))
  Measures=as.vector(array(1:NumEntries))
  for (Index in 1:NumEntries)
  {
    Xs[Index]=runif(1,0,100)
    Ys[Index]=runif(1,0,100)
    if (Xs[Index]>50) Measures[Index]=1
    else Measures[Index]=2
  }
  TheDataFrame = data.frame(Ys, Xs, Measures)
}
######################################################################################
# Creates a 2 dimensional data set with two splits one in the x and one in the y 
# direction.
######################################################################################

Categories2D=function(NumEntries=100)
{
  Ys=as.vector(array(1:NumEntries))
  Xs=as.vector(array(1:NumEntries))
  Measures=as.vector(array(1:NumEntries))
  for (Index in 1:NumEntries)
  {
    Xs[Index]=runif(1,0,100)
    Ys[Index]=runif(1,0,100)
    if (Xs[Index]>50) 
    {
      if (Ys[Index]>50) Measures[Index]=2
      else Measures[Index]=1
    }
    else
    {
      if (Ys[Index]<50) Measures[Index]=2
      else Measures[Index]=1
    }
    
  }
  TheDataFrame = data.frame(Ys, Xs, Measures)
} 

Creating Trees

The code below will run one of the functions above and then create a bubble chart to show the distribution of the data. Remember that this is in a two-dimensional predictor space.

Note that we need to create a copy of the data before converting it to a "SpatialDataFrame". This will allow us to use the variable "TheDataFrame" in analysis in the next step.

TheDataFrame=Categories1D(100) # creates two areas with different values 
SpatialDataFrame=TheDataFrame # make a copy
coordinates(SpatialDataFrame)= ~ Xs+Ys # converts to a SpatialPointDataFrame
bubble(SpatialDataFrame, zcol='Measures', fill=TRUE, do.sqrt=FALSE, maxsize=3,scales = list(draw = TRUE), xlab = "Xs", ylab = "Ys")

The code below will create a tree and output the results. Run this for the first of the synthetic tree functions. Then, take a look at the results and see what you think.

TheDataFrame=na.omit(TheDataFrame) # remove any blanks
TheTree=rpart(Measures~Xs+Ys,data=TheDataFrame,method="anova") # create the tree

par(mar=c(2,2,1,1)) # plot margins: bottom,left, top,right
plot(TheTree) # plot the tree (without labels)
text(TheTree) # add the labels to the branches
print(TheTree) # print the equations for the tree
summary(TheTree) 

Printing the model will produce the following results. Note that in the rpart() function above, we use "anova". Because of this the deviance will simply be the sum of the square of the differences between our model's predicted values and our data values.

n= 100 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 100 162658.6000  91.22948  
   2) Ys< 39.97511 48  43603.5900  63.60367  
     4) Xs< 37.85148 22   4778.8340  37.52148  
       8) Xs< 27.23887 15   1611.7370  30.71262 *
       9) Xs>=27.23887 7    981.5252  52.11190 *
     5) Xs>=37.85148 26  11194.8900  85.67321  
      10) Xs< 64.17446 15   2253.3050  72.29593 *
      11) Xs>=64.17446 11   2596.9370 103.91500 *
   3) Ys>=39.97511 52  48607.2000 116.73020  
     6) Xs< 57.2601 31  13012.0500  97.52345  
      12) Xs< 26.92681 15   4944.0230  82.21864 *
      13) Xs>=26.92681 16   1260.5070 111.87170 *
     7) Xs>=57.2601 21   7277.6820 145.08310  
      14) Ys< 76.65642 14   2322.3320 135.85410 *
      15) Ys>=76.65642 7   1378.0720 163.54090 * 

Note that this is a "code" version of the tree. It starts with the "root" of the tree at node 1. Node 2 shows a split (to the left) when the Y value is less than 39.97511 (Ys< 39.97511). Note that the other side of this split is at node 3 and is for Ys>=39.97511. The tree continues until we have terminal nodes. For each line of this version of the tree, there is also the number of data points at the node, the deviance explained at that node, and the response value (yval) for the node.

Now, we can take a look at the residuals.

TheResiduals=residuals(TheTree) # gets the residuals for our training data
plot(TheResiduals)
summary(TheResiduals)

Predictions

Now, lets create a prediction of our tree to see the results.

ThePrediction=predict(TheTree) # predicts the original data based on the model
par(mar=c(4,4,4,4))
plot(ThePrediction, TheResiduals) 

We can view the predicted output with another bubble chart.

Xs=TheDataFrame$Xs # get the x values
Ys=TheDataFrame$Ys # get the y values

TheDataFrame2 = data.frame(Ys,Xs, ThePrediction) # create a dataframe with the predictions
SpatialDataFrame2=TheDataFrame2 # makes a copy
coordinates(SpatialDataFrame2)= ~ Xs+Ys # converts to a SpatialPointDataFrame
bubble(SpatialDataFrame2, zcol='ThePrediction', fill=TRUE, do.sqrt=FALSE, maxsize=3,scales = list(draw = TRUE), xlab = "Xs", ylab = "Ys") 

Try the procedure above for data created from the Categories2D() and Linear2D() functions

TheDataFrame=Categories2D(100) # creates four areas with different values
TheDataFrame=Linear2D(100) # creates a 2 dimentional linear surface

Optomizing the Tree

Now, lets try using "rpart.control()" to create new parameters for some of our data. The rpart.control() function creates an object we can pass into rpart() to change the complexity of the trees that are fitted to the data.

To find out about the complexity parameter, it is time for us to start reading the standard R documentation. Search on the web for "R rpart" and then find the standard PDF for this library.

TheControl=rpart.control(minsplit=5,cp=0.2) #changes the 
TheTree=rpart(Measures~Xs+Ys,data=TheDataFrame,method="anova",control=TheControl)
par(mar=c(1,1,1,1)) # margins: bottom,left, top,right
plot(TheTree)
text(TheTree)


Creating Predictive Maps for Continuous Variables and Point/CSV Files

We now have the tools we need to create predictive maps. The process below can be used to create maps using a wide variety of modeling approaches including linear models, higher-order polynomial models, tree-based models (CART), Generalized Linear Models (GLMs), and Generalized Additive Models (GAMs.). We'll use this approach in the next three labs to create models with different methods.

  1. Create a CSV file with your sample data. This file should contain your response variables and matching predictor variables. There are a variety of ways to create this file but typically you'll:
    1. Obtain a shapefile of points and measured variables.
    2. Extract the values from your predictors for your points.
    3. Export the files as a CSV (BlueSpray) or save it as a shapefile, open the DBF in Excel, and then save it out as a CSV.
  2. Create your model. Use the methods above to create the tree-based model.
  3. Create your predictive data. This is a CSV that represents your entire sample area and contains columns for the x and y coordinate values and your predictor variable values for those points.
    1. Start with one of your predictor layers and convert it to a point layer ("To Points" in BlueSpray)
    2. In BlueSpray:
      1. Right-click on the point layer and select the attribute table
      2. Right-click on a column heading in the attribute table and select "Insert Column on Right"
      3. Give the new attribute a good name and the type of "Double". Click "OK"
      4. Right click on the attribute heading and select "Extract"
      5. Select the second predictor layer and select OK. This will add the values from the predictor that contain the specified pixel.
      6. Repeat steps 2 through 5 for each predictor layer
      7. Right click on the layer and select "Export to File".
      8. Give the new file a good name and end it with ".csv" to right out a CSV file.
    3. Open the CSV in Excel and change the file's column headers to exactly match the one's you used in R to create the model.
  4. Predict values for the entire sample area. This step adds predictor values to the CSV file you just created.
    1. Use code similar to the code below to create new predictor values in R.
      ThePrediction=predict(TheTree,NewData) # create a vector of predicted values
      
      FinalData=data.frame(NewData,ThePrediction) # combine the new data and the predicted values
      
      write.csv(FinalData, "C:\\Temp\\ModelOutput.csv")  # Save the prediction to a new file
    2. Load the code into a GIS application and then symbolize the points to see the model!
    3. You may need to change the file's column headers for the GIS app to use the headers.
  5. Create a predicted map. The last step is to convert the points into a raster.
    1. In BlueSpray:
      1. Right click on the point layer and select "Transforms -> To Raster"
      2. The only tricky part is picking the correct resolution for the raster to make sure the pixels match the spacing of the points. The resolution should be the same as the original predictor rasters.

Creating Predicted Maps for Categorical Values from Polygon Shapefiles

The process for creating apredicted map for categorical variables from a shapefile is very similar to point files with continuous variables as above until you get to the prediction portion. For categorical varaibles, you'll need to add type="class":

ThePrediction = predict(TheTree, type = "class") 

Then, you can add this back into the data frame and write out the file:

Parcel2$Predict=ThePrediction

write.table(Parcel2, file = "C:/Temp/Parcels.csv", sep = ",")

In ArcMap, you can then Join the column with the prediction to your original file:

Note that the join will only work if the columns are compatible types. For the Parcel layer, I had to convert the APN column to a new column with Long Integer as the type because there were some APNs with invalid values.

On Your Own

Follow the steps for modeling data including:

Refer to the R for Spatial Statistics web site and the section on CART under Section 7, "Correlation/Regression Modeling" and section 8.3 from Plant's book.

Note: You are welcome to use content from previous labs to complete this lab.

 

© Copyright 2018 HSU - All rights reserved.